Online Expectation Maximization for Reinforcement Learning in POMDPs
نویسندگان
چکیده
We present online nested expectation maximization for model-free reinforcement learning in a POMDP. The algorithm evaluates the policy only in the current learning episode, discarding the episode after the evaluation and memorizing the sufficient statistic, from which the policy is computed in closedform. As a result, the online algorithm has a time complexity O ( n ) and a memory complexity O(1), compared to O ( n ) and O(n) for the corresponding batch-mode algorithm, where n is the number of learning episodes. The online algorithm, which has a provable convergence, is demonstrated on five benchmark POMDP problems.
منابع مشابه
Episodic Reinforcement Learning by Logistic Reward-Weighted Regression
It has been a long-standing goal in the adaptive control community to reduce the generically difficult, general reinforcement learning (RL) problem to simpler problems solvable by supervised learning. While this approach is today’s standard for value function-based methods, fewer approaches are known that apply similar reductions to policy search methods. Recently, it has been shown that immedi...
متن کاملEfficient Planning for Factored Infinite-Horizon DEC-POMDPs
Decentralized partially observable Markov decision processes (DEC-POMDPs) are used to plan policies for multiple agents that must maximize a joint reward function but do not communicate with each other. The agents act under uncertainty about each other and the environment. This planning task arises in optimization of wireless networks, and other scenarios where communication between agents is r...
متن کاملFactorized Asymptotic Bayesian Policy Search for POMDPs
This paper proposes a novel direct policy search (DPS) method with model selection for partially observed Markov decision processes (POMDPs). DPSs have been standard for learning POMDPs due to their computational efficiency and natural ability to maximize total rewards. An important open challenge for the best use of DPS methods is model selection, i.e., determination of the proper dimensionali...
متن کاملExpectation Maximization for Weakly Labeled Data
We call data weakly labeled if it has no exact label but rather a numerical indication of correctness of the label “guessed” by the learning algorithm a situation commonly encountered in problems of reinforcement learning. The term emphasizes similarities of our approach to the known techniques of solving unsupervised and transductive problems. In this paper we present an on-line algorithm that...
متن کاملLearning for Decentralized Control of Multiagent Systems in Large, Partially-Observable Stochastic Environments
Decentralized partially observable Markov decision processes (Dec-POMDPs) provide a general framework for multiagent sequential decision-making under uncertainty. Although Dec-POMDPs are typically intractable to solve for real-world problems, recent research on macro-actions (i.e., temporally-extended actions) has significantly increased the size of problems that can be solved. However, current...
متن کامل